Skip to content

Conversation

@mcowger
Copy link
Contributor

@mcowger mcowger commented Aug 27, 2025

Context

Setting this forcibly it can can force the Ollama server to perform an unexpected reload if its configured context is different from our built-in defaults. This ensures the new native-ollama behavior is the same as the previous behavior.

A followup PR will be added to allow this to be overridden in the UI.

Fixes: #2060

Thanks to jebba7151 for finding the root cause.

Implementation

remove num_ctx: modelInfo.contextWindow from client.chat() call.

Screenshots

NA

How to Test

  1. Setup Ollama connection with model that has 16K context configured, but defaults to > 16K in Kilo
  2. Send completion request.
  3. Ensure model does not reload based on config change.

Get in Touch

mcowger on Discord.

… can force the Ollama server to perform an unexpected reload. This ensures the new native-ollama behavior is the same as the previous behavior.

A followup PR will be added to allow this to be overridden in the UI.
@changeset-bot
Copy link

changeset-bot bot commented Aug 27, 2025

🦋 Changeset detected

Latest commit: d0efa75

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
kilo-code Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@chrarnoldus
Copy link
Collaborator

This change causes Ollama to truncate prompts at 4096 tokens, which breaks Kilo Code completely.

time=2025-08-27T22:31:30.253+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=4096 prompt=9089 keep=4 new=4096

@mcowger
Copy link
Contributor Author

mcowger commented Aug 28, 2025

This change causes Ollama to truncate prompts at 4096 tokens, which breaks Kilo Code completely.

time=2025-08-27T22:31:30.253+02:00 level=WARN source=runner.go:128 msg="truncating input prompt" limit=4096 prompt=9089 keep=4 new=4096

So I dont think it does, really.

Many Ollama models default to 4k (qwen3-0.6b my test example), even if they support more.

So if I run with this patch against a default config model like qwen3-0.6b, I get the same output because its configured for 4k:

❯ ollama show qwen3:0.6b
  Model
    architecture        qwen3
    parameters          751.63M
    context length      40960
    embedding length    1024
    quantization        Q4_K_M

But if I create a model file that pushes that up (or the model natively supports a longer context):

PARAMETER temperature 0.6
PARAMETER num_ctx 32768
PARAMETER top_k 20

It handles the request just fine.

So I dont think this changes breaks Kilo with Ollama, it just prevents Kilo from operating with models that have unacceptably small defaults.

FWIW, we'll also inherit this from Roo anyways: 7454

@chrarnoldus
Copy link
Collaborator

We do read the num_ctx from the model parameters:

? parseInt(rawModel.parameters.match(/^num_ctx\s+(\d+)/m)?.[1] ?? "", 10) || undefined

I just tested this in the release version with devstral with num_ctx set to 32k and it seems to work, so I'm not sure what problem you're trying to solve:

llama_context: n_ctx_per_seq (32000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

@mcowger
Copy link
Contributor Author

mcowger commented Aug 28, 2025

We do read the num_ctx from the model parameters:

? parseInt(rawModel.parameters.match(/^num_ctx\s+(\d+)/m)?.[1] ?? "", 10) || undefined

I just tested this in the release version with devstral with num_ctx set to 32k and it seems to work, so I'm not sure what problem you're trying to solve:

llama_context: n_ctx_per_seq (32000) < n_ctx_train (131072) -- the full capacity of the model will not be utilized

The issue is the defaults, and Ollama's behavior when we specify a different value.

  1. Model is loaded with a context windows value of 32768 (either as a default or manually overriden by user in Modelfile or CLI option)
  2. User configures model in Kilo. We send explicitly send num_ctx: value, which is not correctly calculated (because Ollama uses multiple places with different values in its reporting). If we send, say "40960", this will force Ollama to reload the model with the new value, even if that value is incompatible with the users hardware (e.g. wont fit into VRAM, etc). Thus killing performance (and also causing an expensive model reload).

I've just pushed a new commit, that solves this a little better. When the handler is initialized, we interrogate the model info a little better, and use that to estimate if the completion request is going to fit. If not, we throw an error rather than setting num_ctx and forcing a model reload.

Take a look and let me know your thoughts.

@chrarnoldus
Copy link
Collaborator

Your latest commit works if you manually set PARAMETER num_ctx, but if you don't do that you still get a truncated prompt (and an infinite loop because of it). I don't think that's acceptable as a default experience.

@mcowger
Copy link
Contributor Author

mcowger commented Aug 28, 2025

Your latest commit works if you manually set PARAMETER num_ctx, but if you don't do that you still get a truncated prompt (and an infinite loop because of it). I don't think that's acceptable as a default experience.

OK. I think its a tough spot

  1. The current condition causes Ollama to reload and perform poorly or crash and ignore explicitly chosen user parameters on the Ollama side. Its invisible, not clear whats happening, or why the performance / reliability changes.
  2. This solution, (which is already incoming from Roo anyways, though my update today makes it a little better) results in a clear error as to what went wrong, and why.

I'll defer to the Kilo team on your preference.

@chrarnoldus
Copy link
Collaborator

We would love to work on a solution together, it is clear the community is not happy with the current Ollama performance. But there does need to be a solution to the prompt truncation problem, because that is the first thing a new user will see and there is no clear error message in that case.

A followup PR will be added to allow this to be overridden in the UI.

How about implementing this proposal? I tried to implement it before (#1975), but had issues getting the value to sync properly on change (probably nothing insurmountable).

FWIW, we'll also inherit this from Roo anyways: RooCodeInc/Roo-Code#7454

That is a bot-generated PR and no guarantee on quality.

@mcowger
Copy link
Contributor Author

mcowger commented Aug 29, 2025

there is no clear error message in that case

There is. With my updated commit, anything that exceeds the reported or expected limit of the model will have an explicit error thrown that the context is too long.

But I'll leave this to the Kilo team to solve in a way that works for you.

@chrarnoldus
Copy link
Collaborator

There is. With my updated commit, anything that exceeds the reported or expected limit of the model will have an explicit error thrown that the context is too long.

I tested your latest commit, but couldn't get the error to show up for a vanilla model. I added a commit that I think fixes it.

But I'll leave this to the Kilo team to solve in a way that works for you.

We don't use Ollama regularly so your input is very valuable. Please let me know what you think!

Comment on lines +108 to +114
## Preventing prompt truncation

By default Ollama truncates prompts to a very short length.
If you run into this problem, please see this FAQ item to resolve it:
[How can I specify the context window size?](https://github.com/ollama/ollama/blob/4383a3ab7a075eff78b31f7dc84c747e2fcd22b8/docs/faq.md#how-can-i-specify-the-context-window-size)

If you decide to use the `OLLAMA_CONTEXT_LENGTH` environment variable, it needs to be visible to both the IDE and the Ollama server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the real change in this file, the rest is forced autformat.

@mcowger
Copy link
Contributor Author

mcowger commented Aug 29, 2025

Nice, I like the use of the ENV var.

@chrarnoldus chrarnoldus merged commit c509f12 into Kilo-Org:main Aug 29, 2025
11 checks passed
@chrarnoldus
Copy link
Collaborator

Thanks for your contribution @mcowger

@mcowger mcowger deleted the mcowger/ollamaContext branch October 27, 2025 18:29
suissa pushed a commit to suissa/neurohive-kilocode that referenced this pull request Oct 29, 2025
Remove the forced override of the context limit for Ollama API Kilo-Org#2060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Local models Ollama provider using CPU instead of GPU

3 participants